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Abstract 

We examine the problem of approximating a positive, semidefinite matrix £ by 
a dyad xx T , with a penalty on the cardinality of the vector x. This problem arises 
in sparse principal component analysis, where a decomposition of £ involving sparse 
factors is sought. We express this hard, combinatorial problem as a maximum eigen- 
value problem, in which we seek to maximize, over a box, the largest eigenvalue of 
a symmetric matrix that is linear in the variables. This representation allows to use 
the techniques of robust optimization, to derive a bound based on semidefinite pro- 
gramming. The quality of the bound is investigated using a technique inspired by 
Nemirovski and Ben-Tal (2002). 
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Notation 

The notation 1 denotes the vector of ones (with size inferred from context), while Card(x) 
denotes the cardinality of a vector x (number of non-zero elements), and D(x) the diagonal 
matrix with the elements of x on its diagonal. We denote by the unit vectors of W 1 . 
For a n x n matrix X, X y means X is symmetric and positive semi-definite. The 
notation B + , for a symmetric matrix B, denotes the matrix obtained from B by replacing 
negative eigenvalues by 0. The notation has precedence over the trace operator, so that 
Tr B + denotes the sum of positive eigenvalues of B if any, and otherwise. Throughout, 
the symbol E refers to expectations taken with respect to the normal Gaussian distribution 
of dimension inferred from context. Finally, the support of a vector x is defined to be the 
set of indices corresponding to its non-zero elements. 

1 Introduction 

Given a non-zero n x n positive semi-definite symmetric matrix E and a scalar p > 0, we 
consider the cardinality-penalized variational problem 

4>{p) '■= max x t TjX — p Card(x) : ||x||2 = 1. (1) 

X 

This problem is equivalent to solving the sparse rank-one approximation problem 

min ||E — zz T \\ 2 F — pCard(2;), 

z 

which arises in the sparse PC A problem j3J Ej, where a "decomposition" of E into sparse 
factors is sought. We refer to [2] for a motivation of the sparse PCA problem, and an 
overview of its many applications. 

In the paper j2], the authors have developed the "direct sparse PCA" approach, which 
leads to the following convex relaxation for the problem Jl}: 

max TrXE - p\\X\\ x : X y 0, TrX = 1. 

x 

The above problem is amenable to both general-purpose semidefinite programming (SDP) 
interior-point codes, and more recent first-order algorithms such as Nesterov's smooth min- 
imization technique Unfortunately, the quality of the relaxation seems to be hard to 
analyze at present. 

In this paper, we introduce two new representations of the problem, and a new SDP 
bound, based on robust optimization ideas pDj. Our main goal is to use the new representa- 
tions of the problem to analyze the quality of the corresponding bound. 

The paper is organized as follows. Section |2 develops some preliminary results allowing 
to restrict our attention to the case when p < maxj Section El then proposes two new 
representations for 4>(p), one based on largest eigenvalue maximization, and the other on a 
thresholded version of the Rayleigh quotient. In section 01 we derive an SDP-based upper 
bound on 4>(p), and in sectional we analyze its quality: as a function of the penalty parameter 
p first, then in terms of structural conditions on matrix E. 
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It will be helpful to describe E in terms of the Cholesky factorization E = A T A, where 
A = [ai . . . a n ], with a, G R m , % — 1, . . . , n, where m = Rank(E). Further, we will assume, 
without loss of generality, that the diagonal of E is ordered, and none of the diagonal elements 
is zero, so that En > . . . > E n „ > 0. Finally, we define the set Z(p) := {i : Ejj > p}, and 
let n(p) := Card J(p). 



2 Equality vs. Inequality Models 

In the sequel we will develop SDP bounds for the related quantity 

(j)(p) : = max x T Ex — p Card(x) : \\x\\2 < 1. (2) 

X 

The following theorem says that when p < En, the two quantities 0(p), 0(p) are positive 
and equal; otherwise, both <p(p) and 0(p) have trivial solutions. 

Theorem 1 If p < En, we /jave (/>(p) = 0(p) > ; and £ae optimal sets of problems (QJ) 
and (HJ) are tae same. Conversely, if p > En, we have 0(p) = > 0(p) = En — p, and 
a corresponding optimal vector for 0(p) (resp. <fi{p)) is x = e\, the first basis vector in M n 
(resp. x = 0). 

Proof: If p < En, then the choice x = e\ in (J2J) implies <ft(p) > 0, which in turn implies 
that an optimal solution x* for (j2J) is not zero. Since the Card function is scale-invariant, it 
is easy to show that without loss of generality, we can assume that x* has ^-norm equal to 
one, which then results in <fr(p) = <ft(p) > 0. 

Let us now turn to the case when p > E n . We develop an expression for 0(p) as follows. 
First observe that, since E y 0, 

max x t TjX = En, 
|x||i=i 

which implies that, for every x, 

EnlkHi > x T Ex. (3) 
Now let t > 0. The condition 0(p) < —t holds if and only if 

Vi, \\x\\2 — 1 : p Card(x) > t + x T YiX. 

Specializing the above condition to x = e\, we obtain that 0(p) < —t implies p > En + 1. 
Conversely, assume that p > En + 1. Using (jSJ), we have for every x, \\x\\2 = 1: 

pCard(x) > p\\x\\l > (En + t)\\x\\l > x T Ex + t, 

where we have used the fact that ||x||i > 1 whenever ||x||2 = 1. Thus we have obtained that 
4>{p) < —t with t > if and only if p > En + 1, which means that <p(p) = En — p whenever 
P > En. 

Finally, let us prove that 0(p) = when p > En. For every x ^ such that ||x||2 < 1, 
we have 

pCard(x) > - 2 ||x||i > p||x||i > x T Ex, 

\\ X \\2 

which shows that 0(p) < 0, and concludes our proof. ■ 
In the sequel, we will make the following assumption. 
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Assumption 1 We assume that p < En, that is, the setX(p) := {i : Ej, > p} is not empty. 



3 New Representations 

3.1 Largest eigenvalue maximization 

The following theorem shows that the problem of computing (p(p) can be expressed as a 
eigenvalue maximization problem, where the sparsity pattern is the decision variable. 

Theorem 2 For p e [0, £n[, 4>{p) can be expressed as the maximum eigenvalue problem 

4>(p) = max A max UiBi , (4) 
ue[o,i]» \~i ) 

where Bi := a^af — p • I m , i — 1, . . . , n. 

An optimal solution to the original problem (QJ) is obtained from a sparsity pattern vector 
u that is optimal for by finding an eigenvector y corresponding to the largest eigenvalue 
o/D(?i)SD(n), and setting x = D(u)y /\\D(u)y\\2, where D{u) := diag(u). 

Proof. Since p < En, the result of Theorem [T] implies that 0(p) is equal to 0(p) defined in 
in d2J). Let us now prove that = where 

:= max max y T D(u)Y>D(u)y — p ■ l T u. (5) 

(tG{0,l} n y T y<l 

To prove this intermediate result, first note that if x is optimal for <fr(p), that is, for (J2J), 
then we can set Ui = 1 if Xi ^ 0, Ui = otherwise, so that Card(x) = l T u; then, we set 
y = x and obtain that the pair (n, y) is feasible for 4>(p), and achieves the objective value 
0(p), hence < Conversely, if (u,y) is optimal for then x = D{u)y is feasible 

for 0(p) (as expressed in (J2J)), and satisfies Card(x) < Card(w) = l T u, thus 

4>(p) = y T D(u)^D{u)y - pl T u < x T ^x - p Card(a;) < <j>(p), 

This concludes the proof that </>(p) = 4>{p). 

We proceed by eliminating y from (j3J), as follows: 

4>{p) = max X max (D(u)T,D(u)) - p ■ l T u 
«e{o,i} n 

= max X max (D(u)A T AD(u)) - p ■ l T u 
ite{o,i} n 

= max X ms ,JAD(u)A T ) — p ■ l T u 
«e{o,i} n 



it 



= max Xm^y^UiaiaJ) - p ■ l T u, 
i=i 

in virtue of E = A T A, and D(u) 2 = D{u) for every feasible u. Invoking the convexity of the 
largest eigenvalue function, we can replace the set {0, l} n by [0, l] n in the above expression, 
and obtain (JU). ■ 
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3.2 Thresholded Rayleigh quotient 

The following theorem shows that <f>(p) can be expressed as a maximal "thresholded Rayleigh 
quotient", which for p = reduces to the ordinary Rayleigh quotient. 

Theorem 3 For p e [0, £n[, we have 

n 

<f>(p) = max V((af0 2 ~ P)+, (6) 
= max ^{aJXcit - p)+ : X y 0, TrX = 1. (7) 



i=l 



yln optimal solution x for (QJ) is obtained from an optimal solution £ to problem Jjj|) fry setting 
Ui = \ if (ctf£) 2 > p ; = otherwise; then, finding an eigenvector y corresponding to the 
largest eigenvalue of D(u)T,D(u), and setting x = D(u)y/\\D(u)y\\2. 

Proof: From the expression (J3J), we derive 

6{p) = max max £ T > UiCLiaf \ £ — p ■ l T u 
YKHJ «e[o,il" f T «<i I ^ / 



,i=i 



= max ^((of e) 2 - P e T e)+ 

n 

= max ^((af0 2 -p) + , (8) 

where the last equality derives from the fact that <f)(p) > (which is in turn the consequence 
of our assumption that En = max j afa^ > p). Finally, the equivalence between © and (JTj) 
stems from convexity of the objective function in problem (J7J), which implies that without 
loss of generality, we can impose X to be of rank one in (J7J). ■ 

The following corollary shows that we can safely remove columns and rows in S that 
have variance below the threshold p. 

Corollary 1 Without loss of generality, we can assume that every optimal solution to the 
original problem (QJj has a support included in the setX(p) := {i : E« > p}. Thus, ifEu < p, 
the corresponding column and row can be safely removed from S. 

Proof: This is a direct implication of the fact that for every i, if p > ajdi, then we have 
( a l0 2 — P f° r every £ such that £ T £ = 1. Hence, the corresponding term does not appear 
in the sum in (09). I 
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3.3 Exact solutions in some special cases 



Theorems |2] and El allows to solve exactly the problem in some special cases. 

First, Theorem El can be invoked when £ is diagonal, in which case the optimal vector x 
turns out to be simply the first unit vector, e±. 

Next, consider the case when the matrix E has rank one, that is, m — 1. Then, the a^'s 
are scalars, and the representation given in Theorem El yields 

n n 

4>(p) = max ^((a^) 2 - P)+ = ^(a 2 - p) + . 

i=l i=l 

A corresponding optimal solution for 0(p) is obtained by setting Ui = 1 if p < a 2 , ui = 
otherwise, and then setting x = 5/||a||2, with a obtained from a by thresholding a with 
absolute level ^fp. In the sequel, we assume that m > 1. 

A similar result holds when S has the form £ = I + aa T , when a is a given n-vector, 
since then the problem trivially reduces to the rank-one case. 

4 SDP relaxation 

A relaxation inspired by P is given by the following theorem. 

Theorem 4 For every p £ [0, £n[, we have <fr(p) < ip{p), where ip(p) is the solution to a 
semidefinite program: 

rl){p)'.= min A max ( ) : Y t y_B u Y, y 0, z = l,...,n. (9) 

( Y i)i=i \ i=1 J 

The problem can be represented in dual form, as the convex problem 

n 

ip{p) = max J^Tr (X 1/2 ai afX 1/2 -pX) + : X y 0, TrX = 1. (10) 

i=l 

Proof: If (Yi)f =1 is feasible for the above SDP, then for every £ £ R m , £ T £ < 1, and 
u £ [0, l] n , we have 

i=l / i=l \i=l / 

which proves <f>{p) < ^(p). The dual of the SDP (JHJ) is given by 

n 

ip(p) = max ^2(Pi,Bi) : Ih^i^O, z = l,...,n, TrX = 1. (11) 

x,(p<) "=i i=i 

Using the fact that, for any symmetric matrix B, and positive semi-definite matrix X, 
max {{P,B) : X y P y 0} = TV (X 1/2 BX 1/2 ) 



allows to represent the dual problem in the form (JTUJ). Note that the convexity of the 
representation ([TH]) is not immediately obvious. ■ 



A few comments are in order. 

The fact that 0(p) < ip(p) can also be inferred directly from the dual expression (jlOj) : we 
have, by convexity, and using the representation (|7J) for 

rl>(p) > max | ^(aJXa.-p) + : X y 0, TrX = lj = <f>(p). 

From the representation fTt)|) and this, we obtain that if the rank k of X at the optimum of 
the dual problem (jlOj) is one, then our relaxation is exact: <fi(p) = ip(p). 

In fact, problem fTT)|) can be obtained as a rank relaxation of the following exact repre- 
sentation of 0: 

n 

= max V Tr (X 1/2 a ia f X 1/2 - pX) : X y 0, TrX = 1, Rank(A) = 1. 

i=l 

In contrast, applying a direct rank relaxation to problem (JBJ) (that is, writing the problem 
in terms of letting X = ££ T and dropping the rank constraint on X) would be useless: it 
would yield (J7J), which is <fi(p) itself. 

Finally, note that our relaxation shares the property of the exact formulation (JHJ) observed 
in Corollary Q that indices i such that p > Ha can be simply ignored, since then Bi -< 0. 



5 Quality of the SDP relaxation 

In this section, we seek to estimate a lower bound on the quality of the SDP relaxation, 
which we define to be a scalar 9 G [0, 1] such that 

o^(p) < m < Hp). (12) 

Thus, (1 — 9)/ 9 is a upper bound on the relative approximation error, (ip(p) — 0(p))/0(p). 



5.1 Quality estimate as a function of the penalty parameter 

Our first result gives a bound on the relaxation quality conditional on a bound on p. We 
begin by making the following assumption: 

Assumption 2 We assume that < p < min Ejj = E nn; and m = Rank(S) > 1 . 

l<i<n 

From the result of Corollary Q we can always reduce the problem so that the above assump- 
tion holds, by removing appropriate columns and rows of E if necessary. 

Theorem 5 With assumption in force, for every value of the penalty parameter p G 
[0, E nn [ ; and for every 7 > such that 

P<^~ £11, (13) 
n + 7 

7 



the bound FUfy holds with 9 set to 9 m {pf), where for m > 1 and 7 > 0, we define 



e ki 2 



m — 



7 



.J=2 



fit 



) 



+ 



(14) 



which can be computed by the formula 




(15) 



Jo 

For every 7 > ; the value 9 m {l) decreases with m, and admits the bound 




cos 2 (t)sm m - 2 (t)dt 



2 \ 71 V m — 1 I 



(16) 



In particular, if p satisfies MS\) with 7 = 1, that is, p < £n/(n + 1) , £/ien bound hlty 
holds with 9 > I /it. 

Before we prove the theorem, let us make a few comments. 

First, as will be apparent from the proof, the value of m can be safely replaced by the 
rank k of an optimal solution to the SDP (|l()jl. This can only improve the quality estimate, 
as k < m and 9 m {^j) is a decreasing function of m for every 7 > 0. 

Second, the smaller m is, and the larger 7 is, the smaller the corresponding quality 
estimate. However, a small value for 7 does not allow for a large range of p values via ()13|) . 
and this effect is becomes more pronounced as n grows. The theorem presents the result 
in such a way that the respective contributions of m, n to the deterioration of the quality 
estimate are separated. A plot of the function 9 m for various values of m is shown in Figure^ 

Third, the theorem allows to plot the predicted quality estimate 9 as a function of the 
penalty parameter, in the interval [0,S nn [. Leveraging these results to the entire range 
[0, Sn[ will be straightforward, but will require us to be careful about the sizes n and m, as 
they change as p crosses the values E n _ 1)n _!, . . . , En, in view of Corollary [T] We formalize 
the argument in Corollary |21 

Finally, the theorem allows to derive conditions on the structure of X that guarantee a 
prescribed value of the quality. We describe such a condition in Corollary El 

Proof of theorem [5j The approach we use in our proof is inspired by that of Theorem 2.1 
in PJ. Let X >z 0, Tr X — 1, be optimal for the upper bound 4>{p) m dual form (fTU|) . so that 



where B t (X) := X^BiX 1 / 2 . Let k = Rank(X). We have seen that if k — 1, then our 
relaxation is exact: <p(p) = ip(p). If the rest of the proof, we will assume that k > 1. We 
thus have 1 < k < m = Rank(S) < n. 



n 



^(p) = ^Tr(i? i (X) + ), 



i=l 



s 
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Figure 1: Plot of function m (j), as defined in (|15|). for various values of m. 
Assume that we find a scalar 9 £ [0, 1] such that: 

n 

E^(e T J B,(x)o + > mp)) ■ m T *o, (17) 
i=i 

where £ follows the normal distribution in M m . The bound above implies that there exist a 
non-zero £ £ R m such that 

n 

X)(^(x)o + > mp)) ■ (e T ^o- 
t=i 

Thus, with Mj = 1 if t^B^X)^ > 0, ttj = otherwise, we obtain that there exist a non-zero 
£ £ R m and u £ [0, l] n such that 

With z = X l ' 2 i: 



i=l 



J2uiBi)z>(0iP(p))-(z T z). 



,i=l 



The above implies that z ^ 0, so we conclude that there exist u £ [0, l] n such that 

Amax 



. 1=1 



from which we obtain the quality estimate 9ifj(p) < <ft(p) < ip{p)- By a continuity argument, 
this result still holds if (|17|) is satisfied, but not strictly. The rest of the proof is dedicated 
to finding a scalar 6 such that the bound (JT7J) holds. 
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Fix j 6 {1, ... , n}. It is easy to show that Bi(X) has exactly one positive eigenvalue ccj, 
since assumption |21 holds. Thus cc, = TrSj(X) + . Since B^X) -< X 1//2 ajdf AT 1 / 2 , we have 
OLi = X max (Bi(X)) < afXcii. Further, Bi(X) has exactly rank k = Rank(A). Denote by 
(— the negative eigenvalues of Bi(X). We then have 



fc-i 



= Tr - Ti-SipO = oti - (of Aa, - p) < p. 

3=1 

Now let £ follow the normal distribution in IR m , J\f(0,I m ). By rotational invariance of 
the normal distribution, we have: 



fe-i 

-2 

i+i 

3=1 



Thus, 

E(£ T ^(X)£) + > min ^Eja^-^/3^1 : (3 > 0, < p 



fc-i \ fc-i 



3=1 / + 3=1 

m— 1 \ m— 1 



^ Sfe 1 E ( - E^i : £ > 0, J> < p |> (19) 

3=1 / —L 3=1 



m—l 

P 



E H-s3i^ • (20) 



3=1 



where we have exploited the convexity and symmetry in problem (|19|). (As claimed in the 
first remark made after Theorem we could safely keep k instead of m in the remaining of 
the proof.) 

Summing over i, and in view of ip{p) = YH=i a ii we § e t ; 



m—l 

P 



Ej2(fHX)0 + > E E rjE4i (by the bound ©) 

t=l i=l \ 771 j=l J + 

(m-1 \ 
■0(p)£i 2 — y^£ 2 _i_i ] (by homogeneity and convexity) 
m—l ^-^ J J 
3=1 / + 

> e m {-y)-^{p\ (21) 

provided 7 > np/(m — l)tp(p). Using the fact that ip(p) > 0(p) > En — p, we obtain that 
the bound (fT2"j) holds with 9 = #771(7) whenever ()13|) does, as claimed in the theorem. The 
expression (JT5|l of the function # m is proved in Appendix [21 while the bound (fTSJl is proved 
in Appendix El Finally, the fact that the function # m (7) decreases with m > 1 for every 
7 > is a consequence of the following representation: 

{/ m—l \ m—l 
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Indeed, adding constraints (3j = for j > k in the above problem shows that O m {pf) < 6^(7) 
for every 7 > and k < m. M 



0.1 0.2 0.3 0.4 0.5 0.6 



p/E 
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Figure 2: Plot of the function #(p) defined in Corollary [2J for a specific 5x5 
covariance matrix S. The left pane corresponds to a random matrix, and the right 
pane, to a random matrix that satisfies the conditions of Corollary El 



The following corollary allows to plot the quality estimate, as derived from Theorem as 
a function of p across the entire range [0, S n [. We do not make the assumption^ anymore, 
but do keep assumption [T] 

Corollary 2 Let p G [0,En[, and define n(p) = Card{i : Ejj > p} > and m(p) = 
Rank(S(p)), where S(p) is the n(p) x n(p) matrix obtained by removing the last n — n(p) 
rows and columns in E. The bound Mty) holds for 9 = $(p), where 

»{p) = I *™mM0), iip) = ■ ^z- p ifm( P ) > 1, 

y 1 otherwise. 
An example of the resulting plot is shown in Figure 121 

5.2 Quality estimate based on the structure of E 

The next result illustrates how to obtain a quality estimate based on structural assumptions 
on E, requiring that its ordered diagonal decreases fast enough. 

Corollary 3 Assume En > . . . > E nn . // E 2 2 < p < S n , then the bounds FTty hold with 
6 = 1, that is, <fr(p) = ^(p). // in addition, we have, for every h G {2, . . . , n} 

E^ < ^ySn, (22) 
11 



then, whenever < p < E227 the bounds hlty hold with > 1/n. 



Proof: In the case p G [S 2 2,S 11 [, n{p) = 1, so that m(p) = 1, and the bound (fT2*j) holds 
with 9 = 1. Now let p be such that < p < £22- Then there exist h E {2, . . . , n} such that 
Efc+1,/1+1 < p < ^hh, with the convention E n+ i jn+ i = 0. In this case, n(p) = Card{i : > 
p} = h, so that the sufficient condition (|T3*|) with 7 = 1 writes 

p -(^TT) Sl1 ' 



which, in view of 'Eh+i^+i < p < E^, holds when ()22|) holds, independent of p. Applying 
the bound (|16|) ends the proof. ■ 

An example corresponding to the situation of Corollary |H] is shown in Figure 121 (left pane). 
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A A Formula for 6 m 

Let 7 > 0, m > 1. Let us prove that the function 6 m defined in (fT4^) can be represented as 
in (fTo^) . Using the hyperspherical change of variables 

6 = rcos(0i) 

£2 = rsinOi) cos(0 2 ) 

£ m _i = r sin(0i) . . . sin(0 m _ 2 ) cos(> m -i) 
£ m = r sin(0i) . . . sin(0 m _ 2 ) sin(0 m _ 1 ), 

with cj)j G [0, 7r], j = l,...,m — 2, m -i G [0,27r], and with the corresponding change of 
measure 

d£ = r m ~ x sin m - 2 (0i) . . . sin(0 m _ 2 )rf0i . . . 
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we obtain 



Im ' Im\i)i 

where I m is some constant, independent of 7, and 

Jm{l) ■ = 



3=2 



e -ll«lli/2 



3s 2 (0i) 3y sin 2 (0!) J sin m - 2 (0i)^i. 



m 



Since 9 m (0) = 1, we have I m = 1/J m (0). Exploiting symmetry to reduce the integration 
interval from [0, ir] to [0, 7r/2], proves the formula (|15|1. 



B A bound on 6 



m 



The bound stems from the identity a + = (a+ |a|)/2, valid for every a G M, and the following 
result, found in the proof of Theorem 2.1 of 0: 



Wye 



E 



i=l 



7T 
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